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Abstract — The problem of private information "leakage" (in- 
advertently or by malicious design) from the myriad large 
centralized searchable data repositories drives the need for an 
analytical framework that quantifies unequivocally how safe 
private data can be (privacy) while still providing useful benefit 
(utility) to multiple legitimate information consumers. Rate 
distortion theory is shown to be a natural choice to develop 
such a framework which includes the following: modeling of 
data sources, developing application independent utility and 
privacy metrics, quantifying utility-privacy tradeoffs irrespective 
of the type of data sources or the methods of providing privacy, 
developing a side-information model for dealing with questions of 
external knowledge, and studying a successive disclosure problem 
for multiple query data sources. 

I. Introduction 

Information technology and electronic communications 
have been rapidly applied to almost every sphere of human 
activity, including commerce, medicine and social network- 
ing. The concomitant emergence of myriad large centralized 
searchable data repositories has made "leakage" of private 
information such as medical data, credit card information, or 
social security numbers via data correlation (inadvertently or 
by malicious design) an important and urgent societal problem. 
Unlike the well-studied secrecy problem (e.g., [l]-[3]) in 
which the protocols or primitives make a sharp distinction 
between secret and non-secret data, in the privacy problem, 
disclosing data provides informational utility while enabling 
possible loss of privacy at the same time. In fact, in the 
course of a legitimate transaction, a user can learn some public 
information, which is allowed and needs to be supported, and 
at the same time also learn/infer private information, which 
needs to be prevented. Thus every user is (potentially) also 
an adversary. This drives the need for a unified analytical 
framework that can tell us unequivocally and precisely how 
safe private data can be (privacy) while still providing useful 
benefit (utility) to multiple legitimate information consumers. 

It has been noted that utility and privacy are competing 
goals: perfect privacy can be achieved by publishing nothing 
at all, but this has no utility; perfect utility can be obtained 
by publishing the data exactly as received, but this offers 
no privacy [4]. Utility of a data source is potentially (but 
not necessarily) degraded when it is restricted or modified 
to uphold privacy requirements. The central problem of this 
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paper is a precise quantification of the tradeoff between the 
privacy needs of the respondents (individuals represented by 
the data) and the utility of the sanitized (published) data for 
any data source. 

Though the problem of privacy and information leakage has 
been studied for several decades by multiple research commu- 
nities (e.g., [4]-[8] and the references therein), the proposed 
solutions have been both heuristic and application-specific. 
The recent groundbreaking theory of differential privacy [9], 
[10] from the theoretical computer science community is the 
first universal model that applies to any statistical database 
irrespective of application or content. However, the crucial 
challenges of an analytic characterization of both a utility 
metric and the privacy-utility tradeoff remains unaddressed. 
We seek to address these challenges using tools and techniques 
from information theory. 

Rate distortion theory is a natural choice to study the utility- 
privacy tradeoff; utility can be quantified via fidelity which 
in turn is related to distortion and privacy can be quantified 
via equivocation. Our key insight is captured in the following 
theorem we present in this paper: for a data source with private 
and public data and desired utility level, maximum privacy 
for the private data is achieved by minimizing the information 
disclosure rate sufficient to satisfy the desired utility for the 
public data. To the best of our knowledge this is the first 
observation that tightly relates utility and privacy. I 

In a sparsely referenced paper [11] from three decades ago, 
Yamamoto developed the tradeoff between rate, distortion, 
and equivocation for a specific and simple source model. In 
this paper, we show via the above summarized theorem that 
Yamamoto's formalism can be translated into the language 
of data disclosure. Furthermore, we develop a framework 
which allows us to model data sources, specifically databases, 
develop application independent utility and privacy metrics, 
quantify the fundamental bounds on the utility-privacy trade- 
offs, and develop a side-information model for dealing with 
questions of external knowledge, and study the utility-privacy 
tradeoffs for multiple query data sources as a successive 
disclosure problem. The final problem arises in the following 
context: real-world data sources are in general interactive, that 
is, they allow users multiple interactions (queries). However, 
modeling this analytically is particularly challenging. Our 
framework can handle the non-interactive (single query) case 
for any given utility and privacy requirements. In this paper, we 



study the interactive case as a successive disclosure problem 
modeled along the information-theoretic successive refinement 
problem. 

II. The Database Privacy Problem 

A. Problem Definition 

While the problem of quantifying the utility/privacy prob- 
lem applies to all types of data sources, we start our study with 
databases because they are highly structured and historically 
better studied than other types of sources. A database is a 
table (matrix) whose rows represent the individual entries and 
whose columns represent the attributes of each entry [5]. For 
example, the attributes of each entry in a healthcare database 
typically include name, address, social security number (SSN), 
gender, and a collection of medical information and each entry 
contains the information pertaining to an individual. Messages 
from a user to a database are called queries and, in general, 
result in some numeric or non-numeric information from the 
database termed the response. 

The goal of privacy protection is to ensure that, to the 
extent possible, the user's knowledge is not increased beyond 
strict predefined limits by interacting with the database. The 
goal of utility provision is, generally, to maximize the amount 
of information that the user can receive. Depending on the 
relationship between attributes, and the distribution of the 
actual data, a response may contain information that can be 
inferred beyond what is explicitly included in the response. 
The privacy policy defines the information that should not 
revealed explicitly or by inference to the user and depends on 
the context and the application. For example, in a database 
on health statistics, attributes such as name and SSN may 
be considered private data, whereas in a state motor vehicles 
database only the SSN is considered private. The challenge 
for privacy protection is to design databases such that any 
response does not reveal information contravening the privacy 
policy. 

B. Current Approaches and Metrics 

The approaches considered in the literature have centered 
on perturbation (also called sanitization) which encompasses 
a general class of database modification techniques that ensure 
that a user only interacts with a modified database that is de- 
rived from the original (e.g.: [4], [6]-[8]). Most of the current 
perturbation-based approaches are heuristic and application- 
specific and often focus on additive noise approaches. 

Perturbation techniques depend on whether the database is 
considered interactive (i.e. whether the user can issue more 
queries after seeing earlier responses) or non- interactive [9]. 
In the non-interactive model, the database is published after 
a sanitization process in which personal identifiers are elimi- 
nated and the data is perturbed using one of many possible 
input perturbation approaches; alternately in the interactive 
model, the database adds noise to the response based on a 
data model. 



In order to quantify the privacy and utility afforded by a data 
source, metrics are critical. The concept of k-anonymity pro- 
posed by Sweeney [7] captures the intuitive notion of privacy 
that every individual entry should be indistinguishable from 
(k — 1) other entries for some large value of k. More recently, 
researchers in the data mining community have proposed to 
quantify the privacy loss resulting from data disclosure as the 
mutual information between attribute values in the original 
and perturbed data sets, both modeled as random variables 
[8]. Finally, motivated by cryptographic models, the concept 
of differential privacy from theoretical computer science [9], 
[10] has created a universal model for privacy which measures 
the risk of loss of privacy to an individual whose data is in 
a statistical database. However, this work as well the others 
described above do not propose a companion universal utility 
metric that can be guaranteed along with privacy. 

C. Privacy vs. Utility 

While the privacy problem has been studied by multiple 
communities using multiple approaches, the companion utility 
problem has not been studied as analytically and exhaustively 
except in the context of specific applications. Indeed, most 
discussions of privacy assume an implicit utility that is left 
unstated or unmeasured. Utility of a data source is, by ne- 
cessity, a relative concept and is measured from the point 
of view of the user: utility is maximal when the user gets 
full information flow and reduces when the flow of certain 
information is reduced either by restriction or the addition 
of noise. The general concept of utility as a measure of the 
approximation to an underlying (but undisclosed) quantity is 
a fertile area of research (e.g.: [12], [13]). However, these 
measures have not been customized for the context of privacy 
enhancement. Heuristic measures of utility in the context of 
privacy have been proposed (e.g.: [4]) but they do not yield 
a general notion of utility. In our proposed work, we will 
use a working definition of utility as the measure of the 
distance or divergence (using suitably chosen metrics such 
as Euclidean or Kullback-Leibler divergence) between the 
original and sanitized databases. 

III. An Information-Theoretic Approach 
A. Model for Databases 

Circumventing the semantic issue: In general, utility and 
privacy metrics tend to be application specific. Focusing our 
efforts on developing an analytical model, we propose to 
capture a canonical database model and representative abstract 
metrics. Such a model will circumvent the classic privacy 
issues related to the semantics of the data by assuming that 
there exist forward and reverse maps of the data set to the 
proposed abstract format (for e.g., a string of bits or a sequence 
of real values). Such mappings are often implicitly assumed in 
the privacy literature [4], [8], [9]; our motivation for making it 
explicit is to separate the semantic issues from the abstraction 
and apply Shannon-theoretic techniques. 

Model: Our proposed model focuses on large databases with 
K attributes per entry. Let Xk € Xk be a random variable 



denoting the k th attribute, k = 1,2,..., K, and let X = 
[X^,X2, ■ ■ ■ ,Xjc)- A database d with n rows is a sequence 
of n independent observations of X with the distribution 

Px(x) =px 1 x 2 -X K {xi,x 2 , ■ ■ ■ ,x k ) (1) 

which is assumed to be known to the designers of the database. 
Our assumption of row independence in ((TJ is justified because 
correlation in databases is typically across attributes and not 
across entries. We write X™ = (X™, XJ, . . . ,X^-) to denote 
the n independent observations of X. This database model 
is universal in the sense that most practical databases can be 
mapped to this model. 

A joint distribution in (fTJ models the fact that the attributes 
in general are correlated and can reveal information about one 
another. In addition to the revealed information, a user of a 
database can have access to correlated side information from 
other information sources. We model the side-information as 
an n-length sequence Z n which is correlated with the database 
entries via a joint distribution pxz (x,2) . 

Public and private variables: We consider a general model 
in which some attributes need to be kept private while the 
source can reveal a function of some or all of the attributes. 
We write IC r and JCh to denote sets of private (subscript h 
for hidden) and public (subscript r for revealed) attributes, 
respectively, such that K, r U K,h = K, = {1, 2, . . . , K}. We 
further denote the corresponding collections of public and 
private attributes by X r = {X k } kl - Kr and X h = {X k } keKh , 
respectively. Our notation allows for an attribute to be both 
public and private; this is to account for the fact that a database 
may need to reveal a function of an attribute while keeping 
the attribute itself private. In general, a database can choose 
to keep public (or private) one or more attributes (K > 1). 
Irrespective of the number of private attributes, a non-zero 
utility results only when the database reveals an appropriate 
function of some or all of its attributes. 

Special cases: For K = 1, the lone attribute of each 
entry (row) is both public and private, and thus, we have 
X = X r = Xh. Such a model is appropriate for data mining 
[8] and census [4], [6] data sets in which utility generally is 
achieved by revealing a function of every entry of the database 
while simultaneously ensuring that no entry is completely 
revealed. For K = 2 and ICh U K r = K and K,h H K. r = 0, we 
obtain the Yamamoto model in [11]. 

B. Metrics: The Privacy and Utility Principle 

Even though utility and privacy measures tend to be specific 
to the application, there is a fundamental principle that unifies 
all these measures in the abstract domain. The aim of a 
privacy-preserving database is to provide some measure of 
utility to the user while at the same time guaranteeing a 
measure of privacy for the entries in the database. 

A user perceives the utility of a perturbed database to be 
high as long as the response is similar to the response of the 
original database; thus, the utility is highest of an original 
(unpertubed) database and goes to zero when the perturbed 



database is completely unrelated to the original database. Ac- 
cordingly, our utility metric is an appropriately chosen average 
'distance' function between the original and the perturbed 
databases. Privacy, on the other hand, is maximized when the 
perturbed response is completely independent of the data. Our 
privacy metric measures the difficulty of extracting any private 
information from the response, i.e., the amount of uncertainty 
or equivocation about the private attributes given the response. 

C. A Privacy-Utility Tradeoff Model 

We now propose a privacy-utility model for databases. 
Our primary contribution is demonstrating the equivalence 
between the database privacy problem and a source coding 
problem with additional privacy constraints. For our abstract 
universal database model, sanitization is thus a problem of 
mapping a set of database entries to a different set subject to 
specific utility and privacy requirements. Our notation below 
relies on this abstraction. 

Recall that a database d with n rows is an instantiation of 
X". Thus, we will henceforth refer to a real database d as 
an input sequence and to the corresponding sanitized database 
(SDB) d! as an output sequence. When the user has access to 
side information, the reconstructed sequence at the user will 
in general be different from the SDB sequence. 

Our coding scheme consists of an encoder Fe which is 
a mapping from the set of all input sequences (i.e., all 
databases d picked from an underlying distribution) to a set of 
indices W = {1,2,..., M} and an associated table of output 
sequences (each of which is ad') with a one-to-one mapping 
to the set of indices given by 

F E : (A? x A? x . . . x K) keK _ -> W = {SDB k }f =l 

(2) 

where JC r C K. enc C /C and M = 2 nR is the number of 
output (sanitized) sequences created from the set of all input 
sequences. The encoding rate R is the number of bits per entry 
(without loss of generality, we assume n entries in d and d') of 
the sanitized database. The encoding Fe in (O includes both 
public and private attributes in order to model the general case 
in which the sanitization depends on a subset of all attributes. 

A user with a view of the SDB (i.e., an index w G W for 
every d) and with access to side information Z n , whose entries 
Zi, i = 1,2, ... ,n, take values in the alphabet Z, reconstructs 
the database d 1 via the mapping 

fd '■ w x z n — > {k^L e (rwA") ( 3 ) 

where X? = F D (F E (X")). 

A database may need to satisfy multiple utility constraints 
for different (disjoint) subsets of attributes, and thus, we 
consider a general framework with L > 1 utility functions 
that need to be satisfied. Relying on the distance based utility 
principle, we model the I th utility, I = 1,2, . . . , L, via the 
requirement that the average distortion A; of a function /; of 



the revealed variables is upper bounded, for some e > 0, as 



E 



^E£=l5(/l(Xr,<),/l(Xr 



< D, 



i = l,2,...,L, (4) 



where denotes a distortion function, E is the ex- 

pectation over the joint distribution of (X r ,X r ), and the 
subscript i in ~K r ,i an d ^-r,i denotes the i th entry of X™ and 
X™, respectively. Examples of distance-based distortion func- 
tions include the Euclidean distance for Gaussian distributed 
database entries, the Hamming distance for binary input and 
output sequences, and the Kullback-Leibler (K-L) 'distance' 
comparing the input and output distributions. 

Having argued that a quantifiable uncertainty captures the 
privacy of a database, we model the uncertainty or equivo- 
cation about the private variables using the entropy function 
as 



p:A p = -H(X^\W,Z n )>E-e, 



(5) 



i.e., we require the average number of uncertain bits per 
dimension to be lower bounded by E. The case in which side 
information is not available at the user is obtained by simply 
setting Z n = in © and ©. While our general problem 
allows separate constraints on the privacy and utility, we show 
later that for specific canonical databases (census and data 
mining) a constraint on only one of them (utility or privacy) 
suffices (see Corollary [6] in Section Ull-Eb . 

The utility and privacy metrics in (0|l and (0, respectively, 
capture two aspects of our universal model: a) both represent 
averages by computing the metrics across all database instan- 
tiations d, and b) the metrics bound the average distortion 
and privacy per entry. Thus, as the likelihood of the non- 
typical sequences decreases exponentially with increasing n 
(very large databases), these guarantees apply nearly uniformly 
to all (typical) entries. Our general model also encompasses 
the fact that the exact mapping from the distortion and 
equivocation domains to the utility and privacy domains, 
respectively, can depend on the application domain. We write 
D ee (Di,D 2 , . . . , D L ) and A = (A lf A 2 , . . . , A L ). Based 
on our notation thus far, we define the utility-privacy tradeoff 
region as follows. 

Definition 1: The utility -privacy tradeoff region T is the 
set of all feasible utility-privacy tuples (D,E) for which 
there exists a coding scheme (Fe,Fd) given by (O and 
(01, respectively, with parameters (n, M, A, A p ) satisfying the 
constraints in © and (|5j. 

D. Equivalence of Utility-Privacy and Rate-Distortion- 
Equivocation 

We now present an argument for the equivalence of the 
above utility-privacy tradeoff analysis with a rate-distortion- 
equivocation analysis of the same source. For the database 
source model described here, a classic lossy source coding 
problem is defined as follows. 

Definition 2: The set of tuples (R, D) is said to be feasible 
(achievable) if there exists a coding scheme given by (ffjl and 



(0 with parameters (n, M, A) satisfying the constraints in 
and a rate constraint 

M<2 n(R+e) . (6) 
When an additional privacy constraint in (0 is included, 
the source coding problem becomes one of determining the 
achievable rate-distortion-equivocation region defined as fol- 
lows. 

Definition 3: The rate-distortion-equivocation region 1Z is 
the set of all tuples (R, D, E) for which there exists a coding 
scheme given by (0 and (0 with parameters (n,M, A, A p ) 
satisfying the constraints in @, (0, and (0. The set of all 
feasible distortion-equivocation tuples (D,E) is denoted by 
TZd-e, the equivocation-distortion function in the D-E plane 
is denoted by T(D), and the distortion-equivocation function 
which quantifies the rate as a function of both D and E is 
denoted by R (D,E). 

Thus, a rate-distortion-equivocation code is by definition a 
(lossy) source code satisfying a set of distortion constraints 
that achieves a specific privacy level for every choice of the 
distortion tuple. In the following theorem, we present a basic 
result capturing the precise relationship between T and 1Z. To 
the best of our knowledge, this is the first analytical result that 
quantifies a tight relationship between utility and privacy. We 
briefly sketch the proof here; details can be found in [14]. 

Theorem 4: For a database with a set of utility and privacy 
metrics, the tightest utility-privacy tradeoff region T is the 
distortion-equivocation region 1Zd-e- 

Proof: The crux of our argument is the fact that for any 
feasible utility level D, choosing the minimum rate R (D), 
ensures that the least amount of information is revealed about 
the source via the reconstructed variables. This in turn ensures 
that the maximum privacy of the private attributes is achieved 
for that utility since, in general, the public and private variables 
are correlated. For the same set of utility constraints, since 
such a rate requirement is not a part of the utility-privacy 
model, the resulting privacy achieved is at most as large as 
that in TZd-e (see Fig. QIa)). ■ 

Implicit in the above argument is the fact that a utility- 
privacy achieving code does not perform any better than a 
rate-distortion-equivocation code in terms of achieving a lower 
rate (given by log 2 M/n) for the same distortion and privacy 
constraints. This is because if such a code exists then we can 
always find an equivalent source coding problem for which 
the code would violate Shannon's source coding theorem 
[15]. An immediate consequence of this is that a distortion- 
constrained source code suffices to preserve a desired level of 
privacy; in other words, the utility constraints require revealing 
data which in turn comes at a certain privacy cost that 
must be borne and vice-versa. We capture this observation 
in Fig. [TJb) where we contrast existing privacy-exclusive and 
utility-exclusive regimes (extreme points of the utility-privacy 
tradeoff curve) with our more general approach of determining 
the set of feasible utility-privacy tradeoff points. 

From an information-theoretic perspective, the power of 
Theorem is that it allows us to study the larger problem 
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Fig. 1. (a) Rate Distortion Equivocation Region; (b) Utility-Privacy Tradeoff Region. 



of database utility-privacy tradeoffs in terms of a relatively 
familiar problem of source coding with privacy constraints. As 
noted previously, this problem has been studied for a specific 
source model by Yamamoto and here we expand his elegant 
analysis to arbitrary database models including those with side 
information at the user. 



E. Capturing the Effects of Side-Information 

It has been illustrated that when a user has access to an 
external data source (which is not part of the database under 
consideration) the level of privacy that can be guaranteed 
changes [7], [9]. We cast this problem in information-theoretic 
terms as a side information problem. 

In an extended version of this work [14], we have developed 
the tightest utility-privacy tradeoff region for the three cases 
of a) no side information (L = 1 case studied in [11]), b) 
side information only at the user, and c) side information 
at both the source (database) and the user. We present a 
result for the case with side information at the user only 
and for simplicity, we assume a single utility function, i.e., 
L = 1. The proof mimics that of source coding with side 
information in [16] and therefore, involves the use of an 
auxiliary random variable U. The proof also includes bounds 
on the equivocation along the lines of those in [11, Appendix 
1]. The following theorem defines the bounds on the region 
7Z in Definition [3] via the functions T(D) and R(D, E) where 
T(D) bounds the maximal achievable privacy and R(D, E) 
is the minimal information rate (see Fig. 01 a )) f° r ver y large 
databases (n — > oo). The proof is omitted due to space and 
can be found in [14]. 

Theorem 5: For a database with side information available 
only at the user, the functions T(D) and R (D, E) and the 



regions 1Zd-e and 1Z are given by 

T(D)= sup H(X h \UZ) (7) 

p(x,.,x h )p(u|x,.,x h )e:P(.D) 

R(D,E)= inf /(X a X r ; U) - I(Z; U) 

p(x r ,x h Mu\ Xr ,x h )eV(D,E) 

(8) 

Kd-e = {(D,E) : D> 0,0 <E <T(D)} (9) 

K = {(R, D,E) : D > 0,0 < E <T(D) ,R> R\D,E)} 

(10) 

where V (D, E) is the set of all p(x,., x^, z)p(u\x r , x/j) such 
that E[d\Xr,g(U,Z))] < D and H(X h \UZ) > E while 
V (D) is defined as 

V (D) = U H( Wr Z)<E<H^ h \ Z )'P (D, E) . (11) 
While Theorem[5] applies to a variety of database models, it 
is extremely useful in quantifying the utility-privacy tradeoff 
for the following special cases of interest. 

i) The single database problem (i.e., no side information): 
SDB is revealed. Here, we have Z = and U = X r , i.e., 
the reconstructed vectors seen by the user are the same as the 
SDB vectors. 

ii) Completely hidden private variables: Privacy is com- 
pletely a function of the statistical relationship between pub- 
lic, private, and side information data. The expression for 
R(D, E) in (O assumes the most general model of encoding 
both the private and the public variables. When the private 
variables can only be deduced from the revealed variables, i.e., 
X/j— X r — U is a Markov chain, the expression for R(D, E) in 
([H} will simplify to the Wyner-Ziv source coding formulation 
[16], thus clearly demonstrating that the privacy of the hidden 
variables is a function of both the correlation between the 
hidden and revealed variables and the distortion constraint. 

iii) Census and data mining problems without side in- 
formation: Information rate completely determines privacy 
achievable. For Z = 0, setting X r = X/ t = X (such that 



U = X), we obtain the census/data mining problem discussed 
earlier. With this substitution, from Theorem [5] we have the 
maximal achievable equivocation Y{D) = H(X) — R(D), 
where now R(D) = R(D,E). Our analysis formalizes the 
intuition in [8] for using the mutual information as an estimate 
of the privacy lost. However in contrast to [8] in which the 
underlying perturbation model is an additive noise model, 
we assume a perturbation model most appropriate for the 
input statistics, i.e., the stochastic relationship between the 
output and input variables is chosen to minimize the rate of 
information transfer. This fundamental result is captured in the 
following corollary. 

Corollary 6: For the special case of K = 1, i.e., X r = 
X/j = X, the utility-privacy problem is completely defined by 
a utility constraint since the maximum achievable equivocation 
is directly obtainable from the minimal information transfer 
rate. 

F. A Successive Disclosure Problem 

As mentioned earlier, databases can be broadly categorized 
as non-interactive and interactive depending on whether the 
data is sanitized once before publishing or repeatedly in 
response to each query, respectively. For census and similar 
statistical databases a one-shot sanitization is typical whereas 
for more interactive databases multiple queries can lead to 
multiple sanitizations. 

Single-query model: The model and analysis proposed 
in Sections IIII-AfllFDl capture the non-interactive database 
model and the resulting utility-privacy tradeoff region. For this 
one-shot model, sanitization is determined by the choice of 
the utility and privacy metrics defined a priori. In contrast to 
existing approaches that are dominantly focused on additive 
noise perturbations satisfying a large set of queries [17], 
[18], our one-shot approach is independent of queries and 
is designed to satisfy specific utility and privacy constraints. 
Such a model is relevant for databases such as those with 
medical and clinical data that may find repeated uses in the 
future but with queries that cannot be predicted ahead of time 
or which require query-independent strict sanitization prior to 
interaction to ensure regulatory compliance (e.g., US HIPAA 
privacy policies [19]). 

Multiple-query model: For a large majority of data reposito- 
ries, utility is a function of their usage and as such the problem 
of addressing the utility-privacy tradeoffs in a multiple query 
model is imperative. A side-effect of allowing multiple queries 
is that a user can refine her query to learn more information 
at each step, which in turn can lead to privacy breaches. 
Our aim is to determine if a certain level of overall utility 
can be guaranteed while preserving a desired overall privacy 
threshold. In the absence of disclosure controls, a database 
will typically respond to each query independently of the 
previous queries. We seek to develop a model in which the 
database is cognizant of current and past queries in responding 
to future queries. To this end, we assume the existence of 
a data collector that provides an interface for the user to 
submit queries and collate the responses over multiple queries, 



a common assumption in the multi-query literature [9], [10], 
[18]. For this model, under the assumption that the user wishes 
to obtain a refined view of the source, we propose to determine 
whether a source can be successively disclosed, i.e., whether 
a set of overall utility and privacy constraints can be satisfied 
via multiple disclosures with increasing refinement at each 
stage and without any information loss relative to an equivalent 
single-shot model with the same overall utility and privacy 
constraints. 

This problem of successive disclosure has a natural rela- 
tionship to a problem of successive refinement in information 
theory, which pertains to determining whether successively 
revealing data from a source with decreasing distortion at each 
stage can ensure no rate loss relative to a one-shot approach 
with the same final distortion [20]-[22]. We demonstrate this 
analogy in Fig. |2] where, at the first stage, the user obtains 
a specific view (denoted X\ of a source X) of the source 
which in conjunction with the second stage provides a final 
refined view X^- While the successive refinement problem is 
to determine whether R2 = R (D2), the successive disclosure 
problem is that of determining whether R2 — i?(Z?2,^2) 
where D2 < D\ and E2 < E\, As with the successive 
refinement problem, our results can help determine the condi- 
tions and relationships between the input and output sequences 
under which a source can be disclosed successively. 

Analogous to successive refinement, we start by studying a 
multiple disclosure problem in which we seek to determine 
the rates Rq and i?i at which the database responds with 
distortion (utility) and privacy levels (Dq,Eq) and (Di,E\) 
to two queries, respectively, such that a user using both query 
responses can reconstruct a response at a distortion-privacy 
level of (Z?2,^2)- Analogous to the relationship between 
multiple description and successive refinement, the successive 
disclosure problem described here is a special case of the 
multiple disclosure problem for which there is no rate loss, 
i.e., R 1 =R(D 1 ,E 1 ) and R + R 1 =R(D 2 ,E 2 ). 

While a detailed analysis of this problem can be found in 
an extended version of this work [14], we now present two 
example privacy problems for which the successive refinement 
problem presents immediate insights on the effects of refined 
disclosure. The two problems are privacy preservation in 
census and data mining databases, and in both cases, we 
briefly argue that the successive disclosure problem simplifies 
to the successive refinement problem. Recall that in Corollary 
[6] we showed that the census and data mining problems are 
special cases for which the rate-distortion-equivocation region 
is directly obtainable from the rate-distortion curve because for 
both problems the public and the private variables are the same 
as a result of which the maximum achievable equivocation 
is directly obtainable from the rate-distortion function. The 
following theorem summarizes our result. 

Theorem 7: For K = 1 databases, successive disclo- 
sure with distortion-privacy pairs (Di,E\) and (Z?2,^2) are 
achievable if and only if there exists a conditional distribution 
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Fig. 2. Successive Refinement and Successive Disclosure Problems. 



P (xi,x 2 \x) with 



E 



g(x,x, 



<D k , k = 1,2, 



such that 



R(D k) E k )=I(X;X k ), fc = l,2, 



and X — X2 — X\ form a Markov chain, i.e.. 
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p(xi,x 2 \x) = p (x 2 \x) p (xi\x 2 ) ■ 
Thus, for these two special but fundamentally important 
problems, we can show that the Markov condition X— X 2 — Xi 
(see Fig. |2]l required for successive refinement [20, Theorem 
2] also hold here and in fact suffices to satisfy the successive 
disclosure requirement of no additional rate or privacy leak- 
age. More work is needed to address questions such as the 
practical implications of the above Markov condition [2 1 ] and 
generalizing the solution to arbitrary sources. 

IV. Concluding Remarks 

We have presented an abstract model for databases with an 
arbitrary number of public and private variables, developed 
application-independent privacy and utility metrics, used rate 
distortion theory to determine the fundamental utility-privacy 
tradeoff limits, and introduced a successive disclosure problem 
to study utility-privacy tradeoffs and determine the conditions 
for no privacy loss for multiple query data sources. Future 
work includes generalizing the results to distributed data 
sources and relating current approaches in computer science 
and our universal approach. 
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